关于MapReduce中自定义分区类(四)
MapTask类
在MapTask类中找到run函数
if(useNewApi){
runNewMapper(job, splitMetaInfo, umbilical, reporter);
}
再找到runNewMapper
@SuppressWarnings("unchecked")
private<INKEY,INVALUE,OUTKEY,OUTVALUE>
void runNewMapper(final JobConf job,
final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical,
TaskReporter reporter
) throws IOException,ClassNotFoundException,
InterruptedException{
// make a task context so we can get the classes
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,
getTaskID(),
reporter);
// make a mapper
org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
(org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
// make the input format
org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
// rebuild the input split
org.apache.hadoop.mapreduce.InputSplit split = null;
split = getSplitDetails(newPath(splitIndex.getSplitLocation()),
splitIndex.getStartOffset());
LOG.info("Processing split: "+ split);
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
newNewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.RecordWriter output = null;
// get an output object
if(job.getNumReduceTasks()==0){
output = 如果jreduce个数等于0.则执行该方法
newNewDirectOutputCollector(taskContext, job, umbilical, reporter);
}else{
如果reduce个数大于0.则执行该方法
output =newNewOutputCollector(taskContext, job, umbilical, reporter);
}
org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
mapContext =
newMapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),
input, output,
committer,
reporter, split);
org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
mapperContext =
newWrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
mapContext);
try{
input.initialize(split, mapperContext);
mapper.run(mapperContext);
mapPhase.complete();
setPhase(TaskStatus.Phase.SORT);
statusUpdate(umbilical);
input.close();
input = null;
output.close(mapperContext);
output = null;
} finally {
closeQuietly(input);
closeQuietly(output, mapperContext);
}
}
我们知道,分区是在map函数输出的时候做的 ,所以这里是get output object
// get an output object
if(job.getNumReduceTasks()==0){
output = 如果jreduce个数等于0.则执行该方法
newNewDirectOutputCollector(taskContext, job, umbilical, reporter);
}else{
如果reduce个数大于0.则执行该方法
output =newNewOutputCollector(taskContext, job, umbilical, reporter);
}
如果没有reduce任务,则new NewDirectOutputCollector()
(Collection过程我还没探索过呢)
如果有NewOutputCollector任务,则运行new NewOutputCollector()
内部类NewOutputCollector
在内部类NewOutputCollector中找到该方法(构造方法)
NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
JobConf job,
TaskUmbilicalProtocol umbilical,
TaskReporter reporter
) throws IOException,ClassNotFoundException{
collector = createSortingCollector(job, reporter);
partitions = jobContext.getNumReduceTasks();
if(partitions >1){
partitioner =(org.apache.hadoop.mapreduce.Partitioner<K,V>)
ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);
}else{
partitioner =new org.apache.hadoop.mapreduce.Partitioner<K,V>(){
@Override
publicint getPartition(K key, V value,int numPartitions){
return partitions -1;
}
};
}
}
通过partitions = jobContext.getNumReduceTasks();语句获取到Reduce任务个数
如果Reduce任务数小于等于1,则新建一个Partitioner对象的同时并复写getPartition方法,这个复写的方法直接统一返回-1,就都在一个分区了。
如果Reduce任务数大于 ,则通过反射创建jobContext.getPartitionerClass()获取到的对象
于是查看:
jobContext接口
jobContext接口中的
/**
* Get the {@link Partitioner} class for the job.
*
* @return the {@link Partitioner} class for the job.
*/
publicClass<? extends Partitioner<?,?>> getPartitionerClass()
throws ClassNotFoundException;
我们还是看其实现类jobContextImpl吧
jobContextImpl类
注意是在mapreduce包下啊,不是mapred包下
/**
* Get the {@link Partitioner} class for the job.
*
* @return the {@link Partitioner} class for the job.
*/
@SuppressWarnings("unchecked")
publicClass<? extends Partitioner<?,?>> getPartitionerClass()
throws ClassNotFoundException{
return(Class<? extends Partitioner<?,?>>)
conf.getClass(PARTITIONER_CLASS_ATTR,HashPartitioner.class);
}
conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);
的意思是,从PARTITIONER_CLASS_ATTR属性中取出值,作为类返回,如果不存在,则使用和默认值HashPartitioner.class
也就是说,当Reduce个数大于1的时候,其默认调用的是HashPartitioner.class
publicclassHashPartitioner<K, V>extendsPartitioner<K, V>{
/** Use {@link Object#hashCode()} to partition. */
publicint getPartition(K key, V value,
int numReduceTasks){
return(key.hashCode()&Integer.MAX_VALUE)% numReduceTasks;
}
}
发现HashPartitioner调用的是getPartition方法,最终使用的是key对象中的hashcode方法
而我们使用eclipse(Alt+Shift+ S 按下H)复写的hashcode是将两个属性(账户和金额都考虑进去了)
嗯,果然自己修改自定义key类中的hashcode,测试了一下是可以的,只要hashcode是只根据我们的账户account进行生产
@Override
publicint hashCode(){
final int prime =31;
int result =1;
result = prime * result +((account == null)?0: account.hashCode());
// result = prime * result + ((amount == null) ? 0 : amount.hashCode());
return result;
}
另一种更主流的方式:
自定义的Partition类为什么要是Group的内部类呢?自己改为外部类自己测试下,发现完全可以
具体的形式
publicstaticclassKeyPartitioner extends Partitioner<SelfKey,DoubleWritable>{
@Override
publicint getPartition(SelfKey key,DoubleWritable value,int numPartitions){
/**
* 如何保证数据整体输出上的有序,需要我们自定义业务逻辑
* 必须提示前知道num reduce task 个数?
* \w 单词字符[a-zA-Z_0-9]
*
*/
String account =key.getAccount();
//0xxaaabbb 0-9
//[0-2][3-6][7-9]
if(account.matches("\\w*[0-2]")){
return0;
}elseif(account.matches("\\w*[3-6]")){
return1;
}elseif(account.matches("\\w*[7-9]")){
return2;
}
return0;
}
}
这是为了保证S1和S2都在分区1,而不会出现S1中的其中几个在分区1 ,另外几个在分区2
因为我们此时的键——是账户+金额,所以可能明明都是账户S1的分区却不一样,最后导致排序混乱?
苍茫大海,旅鼠它来!